Skip to content

Add CommentStrategy to optionally preserve leading comments#53

Merged
JanTvrdik merged 10 commits into
mainfrom
mysql-preserve-leading-comments
Jun 1, 2026
Merged

Add CommentStrategy to optionally preserve leading comments#53
JanTvrdik merged 10 commits into
mainfrom
mysql-preserve-leading-comments

Conversation

@JanTvrdik
Copy link
Copy Markdown
Member

@JanTvrdik JanTvrdik commented May 29, 2026

Summary

Adds the ability to retain SQL comments that precede a query, via a pluggable CommentStrategy injected into the parser constructor. By default comments are still stripped, so existing behavior — and the public Iterator<string> return type — are unchanged.

This supersedes the original preserveLeadingComments: bool flag (see discussion below): rather than bake an opinionated comment-handling policy into the parser, the policy is now a swappable strategy and the parser core stays neutral.

Motivation

The SQL that runs is visible in observability tools, and it's a common pattern to prepend a comment to a query so it shows up there. Prepending (not appending) matters because the full query is often truncated in those tools:

-- why this risky alter is safe to run
ALTER something_complicated;

Previously the parser unconditionally stripped such comments.

Design

The parser now tokenizes input into a neutral stream of Query and Comment fragments; a CommentStrategy collapses that stream into the final query strings:

interface CommentStrategy
{
    /**
     * @param  Iterator<Fragment> $fragments
     * @return Iterator<string>
     */
    public function apply(Iterator $fragments): Iterator;
}

Two strategies are bundled:

  • DropComments (default) — drops every comment; reproduces the historical output byte-for-byte.
  • PrependLeadingComments — prepends the comments preceding a query as a prefix of that query.
$parser = new MySqlMultiQueryParser(new PrependLeadingComments());

foreach ($parser->parseString("-- create the users table\nCREATE TABLE users (id INT);") as $query) {
    echo $query; // "-- create the users table\nCREATE TABLE users (id INT)"
}

A user wanting a different policy (blank-line separators, trailing-comment handling, …) implements CommentStrategy themselves; the non-standardized "a comment belongs to the following query" assumption lives only in PrependLeadingComments, not in the parser.

Behavior (PrependLeadingComments)

  • All comment styles of the dialect (--, /* */, and # for MySQL) directly preceding a query are preserved with their original formatting; only pure leading whitespace is stripped.
  • A comment between two queries attaches to the following one.
  • A comment with no following query (a trailing comment at end of input, or one preceding a MySQL DELIMITER directive) is faithfully emitted as a Comment fragment, but dropped by both bundled strategies since there is no query to attach it to.

BC

  • Public return type stays Iterator<string>; IMultiQueryParser is unchanged.
  • Default DropComments ⇒ no behavior change.
  • Still targets PHP 8.0 (no readonly / new in default args).

Tests

  • Dialect-agnostic comment cases run across all four parsers; MySQL keeps its #- and DELIMITER-specific cases.
  • Fragment-level tests assert the parser faithfully emits Comment fragments even when no query follows (trailing comment, comment-before-DELIMITER).
  • Streaming chunk-boundary safety verified exhaustively; phpstan level 8 + strict-rules clean.

Co-Authored-By: Claude Code

JanTvrdik added 2 commits May 29, 2026 13:51
By default comments preceding a query are stripped. The new opt-in
constructor flag `preserveLeadingComments` keeps them as a prefix of
the yielded query string instead, which is useful when comments carry
meaningful annotations.

The leading "skip" zone of the query regex is split into a part that
strips pure leading whitespace and a captured `leadingComments` group
that collects --, # and /* */ comments (with their original formatting)
directly preceding a query. A comment between two queries is treated as
preceding the following one; comments not followed by any query are
dropped. Default behavior is unchanged.

Co-Authored-By: Claude Code
Move the `preserveLeadingComments` constructor flag and the comment
prepending logic into BaseMultiQueryParser, and apply the same regex
restructure (strip pure leading whitespace, then capture preceding
comments into a `leadingComments` group) to the PostgreSQL, SQL Server
and SQLite parsers.

The shared, dialect-agnostic behavior is now tested once in
MultiQueryParserTestCase (so every parser runs it), including a
two-chunk-boundary streaming test; MySQL keeps its # hash-comment
specific cases. Default behavior is unchanged for all parsers.

Co-Authored-By: Claude Code
@JanTvrdik JanTvrdik changed the title Add preserveLeadingComments option to MySQL parser Add preserveLeadingComments option to all parsers May 29, 2026
JanTvrdik added 2 commits May 29, 2026 14:09
It is read only inside BaseMultiQueryParser::buildQuery(); no subclass
accesses it, so `protected` advertised an extension point that does not
exist. Tighten the boundary to `private`.

Co-Authored-By: Claude Code
PatternIterator declared its yielded match as array<mixed>, but
preg_match results (no PREG_OFFSET_CAPTURE / PREG_UNMATCHED_AS_NULL)
are always string-valued. Declaring array<array-key, string> reflects
that guarantee and lets buildQuery() return/concatenate the match
values directly, removing the (string) casts.

Co-Authored-By: Claude Code
@JanTvrdik JanTvrdik marked this pull request as ready for review May 29, 2026 12:15
Copilot AI review requested due to automatic review settings May 29, 2026 12:15
The behavior is documented in the readme; the typed, self-describing
parameter does not need a duplicate prose comment.

Co-Authored-By: Claude Code
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in preserveLeadingComments parser option so leading SQL comments can be retained as part of yielded query strings while preserving existing default behavior.

Changes:

  • Adds shared constructor state and buildQuery() logic in BaseMultiQueryParser.
  • Updates all dialect parsers to capture leading comments and prepend them when enabled.
  • Adds shared and MySQL-specific tests plus README documentation for the new option.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/BaseMultiQueryParser.php Adds the shared option and query-building helper.
src/MySqlMultiQueryParser.php Captures leading MySQL comments, including #, and uses shared query building.
src/PostgreSqlMultiQueryParser.php Captures leading PostgreSQL comments and uses shared query building.
src/SqlServerMultiQueryParser.php Captures leading SQL Server comments and uses shared query building.
src/SqliteMultiQueryParser.php Captures leading SQLite skip/comment regions and uses shared query building.
src/PatternIterator.php Tightens iterator match type documentation.
tests/inc/MultiQueryParserTestCase.php Adds shared preservation behavior and chunk-boundary tests.
tests/cases/MySqlMultiQueryParserTest.phpt Adds MySQL hash-comment preservation coverage and constructor option support.
tests/cases/PostgreSqlMultiQueryParserTest.phpt Passes the new option through test parser creation.
tests/cases/SqlServerMultiQueryParserTest.phpt Passes the new option through test parser creation.
tests/cases/SqliteMultiQueryParserTest.phpt Passes the new option through test parser creation.
readme.md Documents preserveLeadingComments usage and behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JanTvrdik JanTvrdik requested a review from hrach May 29, 2026 12:36
@JanTvrdik
Copy link
Copy Markdown
Member Author

Another option would be to consider the previous behavior where comments were stripped a bug, and and make this simple into bug fix without the extra option.

@hrach
Copy link
Copy Markdown
Member

hrach commented May 30, 2026

The PR description is missing the most important part -> the motivation. What's the use case? Why is 'preserveLeadingComments' needed? Why don't trailing comments need to be preserved? Or maybe comments in between? (I guess, a comment before a comment won't get parsed).

@JanTvrdik
Copy link
Copy Markdown
Member Author

The SQL that is running is visible in observability tools. It's a common pattern to prepend a comment the query to make it visible in the observability tools (prepend vs append is important, because the full query is often not visible).

-- some explanation for why we need this alter
-- or why we think it's safe to run at all
ALTER something_complicated_and_risky;

The obvious problem is that given sequence query -> comment -> query, it's not obvious to which query the comment belongs.

SELECT 1; -- hello
SELECT 2; -- world

vs

-- hello
SELECT 1;

-- world 
SELECT 2;

This PR assumes that the second case is far more common and that it's not worth trying to fine tune the heuristics to support the first case as well. And as consequence the trailing comments are not supported, since there is no query to attach them to.

Or maybe comments in between

Comment in between is a comment before query. Multiple comments before a query are supported.

@hrach
Copy link
Copy Markdown
Member

hrach commented Jun 1, 2026

If we produce [sql, comment, sql, comment] it can be easily processed on the user side as [sql, comment+sql] - without any assumptions on the structure.

Isn't that a bit more universal? We may bundle such an iterator as well 🤔

@JanTvrdik
Copy link
Copy Markdown
Member Author

That would be a major BC break. And you can't actually process this on user-side any better than this PR does it, unless we provide additional metadata such as line number and offset.

That should all be possible but it would be much larger in scope and require a new major version.

@hrach
Copy link
Copy Markdown
Member

hrach commented Jun 1, 2026

That would be a major BC break

Well, first we assess the arch & design, then the means to achieve BC, right?

E.g., bool $includeComments = false is pretty much the same way of change producing the mentioned output.

unless we provide additional metadata such as line number and offset

Could you give an example when this is required to parse it correctly then?

@hrach
Copy link
Copy Markdown
Member

hrach commented Jun 1, 2026

The problematic part on the user side is the detection if something is a comment, isn't it?

@JanTvrdik
Copy link
Copy Markdown
Member Author

Could you give an example when this is required to parse it correctly then?

If you want to resolve to which query a comment belong, you need the positional information to determine if the comment started on the same line as the previous query. Or if it started on a new line.

The problematic part on the user side is the detection if something is a comment, isn't it?

Yes. That's why I assume it is BC break, because you need to wrap the strings with Query and Comment values classes.

@JanTvrdik
Copy link
Copy Markdown
Member Author

JanTvrdik commented Jun 1, 2026

Possible BC solution, but not pretty

	/**
	 * @param  positive-int $chunkSize
	 * @return Iterator<($includeComments is true ? Query|Comment : string)>
	 * @throws RuntimeException
	 */
	public function parseFile(
		string $path,
		int $chunkSize = self::DEFAULT_CHUNK_SIZE,
		bool $includeComments = false,
	): Iterator;

or

	/**
	 * @param  positive-int $chunkSize
	 * @return Iterator<Query|Comment>
	 * @throws RuntimeException
	 */
	public function parseFileWithComments(
		string $path,
		int $chunkSize = self::DEFAULT_CHUNK_SIZE,
	): Iterator;

I prefer the solution in this PR to both of these.

@hrach
Copy link
Copy Markdown
Member

hrach commented Jun 1, 2026

It seems you are focused on the API and BC, and not on the merit of what functionality we want to offer. So let me switch to this approach as well :)

My Kotlin-like mind would offer this.
Yes, it is only binary compatible (aka runtime compatible), not source-compatible.

https://phpstan.org/r/70b27277-20ff-446e-9790-22dbccc77392

/**
 * @template O
 */
interface Mapper {
	/**
	 * @param Iterator<int, Query|Comment> $iterator
	 * @return Iterator<int, O>
	 */
	function map(Iterator $iterator): Iterator;
}

/**
 * @template O
 * @implements IMultiQueryParser<O>
 */
class BaseMultiQueryParser implements IMultiQueryParser {
	/**
	 * @param Mapper<O> $mapper
	 */
	function __construct(
		private Mapper $mapper = new QueryOnlyMapper(),
	) {}

	/**
	 * @param  Iterator<string> $stream
	 * @return Iterator<O>
	 */
	abstract public function parseStringStream(Iterator $stream): Iterator;
}

EDIT: PHPStorm type inference sucks, tho.

@hrach
Copy link
Copy Markdown
Member

hrach commented Jun 1, 2026

Yes, basically, I'm trying to come up with a solution that won't hardcode the very custom non-standardized behavior attaching comments to the following queries.

Obviously, it makes very much sense for your company/use case. Yet, somebody may come and tell that the separator cannot be two empty lines, because then the meaning of the comment is not specific to the next query. Those POV are similarly valid as many another approchaes. That makes me feel I would abstract it differently.

@JanTvrdik
Copy link
Copy Markdown
Member Author

JanTvrdik commented Jun 1, 2026

If we provide custom inner callback / interface, then we don't need to change the final type from Iterable<string>, so that would I think solve everything.


Implemented as CommentStrategy interface.

JanTvrdik added 5 commits June 1, 2026 22:22
Parsers now tokenize input into a neutral Query|Comment fragment stream;
a pluggable CommentStrategy collapses it into the final query strings.
The default DropComments reproduces the historical comment-stripping
output, while KeepLeadingComments prepends a query's leading comments.
This keeps the opinionated comment-handling policy out of the parser core
and leaves the public return type as Iterator<string>.

Co-Authored-By: Claude Code
The name now states the mechanism (the comment is prepended to the
following query, producing one combined string) and encodes the
prepend-vs-append distinction that motivates the feature.

Co-Authored-By: Claude Code
@JanTvrdik JanTvrdik changed the title Add preserveLeadingComments option to all parsers Add CommentStrategy to optionally preserve leading comments Jun 1, 2026
Copy link
Copy Markdown
Member

@hrach hrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 oh yes, this is nicely reasonable. not exposing the fragment type publicly, but simple enough to provide solution.

@JanTvrdik JanTvrdik merged commit 59363bc into main Jun 1, 2026
12 checks passed
@JanTvrdik JanTvrdik deleted the mysql-preserve-leading-comments branch June 1, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants